50 research outputs found

    Prosody Modification using Allpass Residual of Speech Signals

    Get PDF
    In this paper, we attempt to signify the role of phase spectrum of speech signals in acquiring an accurate estimate of excitation source for prosody modification. The phase spectrum is parametrically modeled as the response of an all pass (AP) filter, and the filter coefficients are estimated by considering the linear prediction (LP) residual as the output of the AP filter. The resultant residual signal, namely AP residual, exhibits unambiguous peaks corresponding to epochs, which are chosen as pitch markers for prosody modification. This strategy efficiently removes ambiguities associated with pitch marking, required for pitch synchronous overlap-add (PSOLA) method. The prosody modification using AP residual is advantageous than time domain PSOLA (TD-PSOLA) using speech signals, as it offers fewer distortions due to its flat magnitude spectrum. Windowing centered around unambiguous peaks in AP residual is used for segmentation, followed by pitch/duration modification of AP residual by mapping of pitch markers. The modified speech signal is obtained from modified AP residual using synthesis filters. The mean opinion scores are used for performance evaluation of the proposed method, and it is observed that the AP residual-based method delivers equivalent performance as that of LP residual based method using epochs, and better performance than the linear prediction PSOLA (LP-PSOLA)

    Experimental studies on effect of speaking mode on spoken term detection

    Get PDF
    The objective of this paper is to study the effect of speaking mode on spoken term detection (STD) system. The experiments are conducted with respect to query words recorded in isolated manner and words cut out from continuous speech. Durations of phonemes in query words greatly vary between these two modes. Hence pattern matching stage plays a crucial role which takes care of temporal variations. Matching is done using Subsequence dynamic time warping (DTW) on posterior features of query and reference utterances, obtained by training Multilayer perceptron (MLP). The difference in performance of the STD system for different phoneme groupings (45, 25, 15 and 6 classes) is also analyzed. Our STD system is tested on Telugu broadcast news. Major difference in STD system performance is observed for recorded and cut-out types of query words. It is observed that STD system performance is better with query words cut out from continuous speech compared to words recorded in isolated manner. This performance difference can be accounted for large temporal variations

    Unsupervised Speech Signal to Symbol Transformation for Zero Resource Speech Applications

    Get PDF
    Zero resource speech processing refers to a scenario where no or minimal transcribed data is available. In this paper, we propose a three-step unsupervised approach to zero resource speech processing, which does not require any other information/dataset. In the first step, we segment the speech signal into phonemelike units, resulting in a large number of varying length segments. The second step involves clustering the varying-length segments into a finite number of clusters so that each segment can be labeled with a cluster index. The unsupervised transcriptions, thus obtained, can be thought of as a sequence of virtual phone labels. In the third step, a deep neural network classifier is trained to map the feature vectors extracted from the signal to its corresponding virtual phone label. The virtual phone posteriors extracted from the DNN are used as features in the zero resource speech processing. The effectiveness of the proposed approach is evaluated on both ABX and spoken term discovery tasks (STD) using spontaneous American English and Tsonga language datasets, provided as part of zero resource 2015 challenge. It is observed that the proposed system outperforms baselines, supplied along the datasets, in both the tasks without any task specific modification

    Action-vectors: Unsupervised movement modeling for action recognition

    Get PDF
    Representation and modelling of movements play a significant role in recognising actions in unconstrained videos. However, explicit segmentation and labelling of movements are non-trivial because of the variability associated with actors, camera viewpoints, duration etc. Therefore, we propose to train a GMM with a large number of components termed as a universal movement model (UMM). This UMM is trained using motion boundary histograms (MBH) which capture the motion trajectories associated with the movements across all possible actions. For a particular action video, the MAP adapted mean vectors of the UMM are concatenated to form a fixed dimensional representation referred to as 'super movement vector' (SMV). However, SMV is still high dimensional and hence, Baum-Welch statistics extracted from the UMM are used to arrive at a compact representation for each action video, which we refer to as an 'action-vector'. It is shown that even without the use of class labels, action-vectors provide a more discriminatory representation of action classes translating to a 8 % relative improvement in classification accuracy for action-vectors based on MBH features over naïve MBH features on the UCF101 dataset. Furthermore, action-vectors projected with LDA achieve 93% accuracy on the UCF101 dataset which rivals state-of-the-art deep learning techniques

    Feature selection using Deep Neural Networks

    Get PDF
    Feature descriptors involved in video processing are generally high dimensional in nature. Even though the extracted features are high dimensional, many a times the task at hand depends only on a small subset of these features. For example, if two actions like running and walking have to be identified, extracting features related to the leg movement of the person is enough. Since, this subset is not known apriori, we tend to use all the features, irrespective of the complexity of the task at hand. Selecting task-aware features may not only improve the efficiency but also the accuracy of the system. In this work, we propose a supervised approach for task-aware selection of features using Deep Neural Networks (DNN) in the context of action recognition. The activation potentials contributed by each of the individual input dimensions at the first hidden layer are used for selecting the most appropriate features. The selected features are found to give better classification performance than the original high-dimensional features. It is also shown that the classification performance of the proposed feature selection technique is superior to the low-dimensional representation obtained by principal component analysis (PCA)

    Novel speech duration modifier for packet based communication system

    Get PDF
    In this paper, we propose a real-time method for duration modification of speech for packet based communication system. While there is rich literature available on duration modification, it fails to clearly address the issues in real-time implementation of the same. Most of the duration modification methods rely on accurate estimation of pitch marks, which is not feasible in a real-time scenario. The proposed method modifies the duration of Linear Prediction residual of individual frames without using any look-ahead delay and knowledge of pitch marks. In this method, multiples of pitch period is repeated or removed from a frame depending on a scheduling algorithm. The subjective quality of the proposed method was found to be better than waveform similarity overlap and add (WSOLA) technique as well as Linear Prediction Pitch Synchronous Overlap and Add (LP-PSOLA) technique

    IITG-Indigo System for NIST 2016 SRE Challenge

    Get PDF
    This paper describes the speaker verification (SV) system submitted to the NIST 2016 speaker recognition evaluation (SRE) challenge by Indian Institute of Technology Guwahati (IITG) under the fixed training condition task. Various SV systems are developed following the idea-level collaboration with two other Indian institutions. Unlike the previous SREs, this time the focus was on developing SV system using non-target language speech data and a small amount unlabeled data from target language/dialects. For addressing these novel challenges, we tried exploring the fusion of systems created using different features, data conditioning, and classifiers. On NIST 2016 SRE evaluation data, the presented fused system resulted in actual detection cost function (actDCF) and equal error rate (EER) of 0.81 and 12.91%, respectively. Post-evaluation, we explored a recently proposed pairwise support vector machine classifier and applied adaptive S-norm to the decision scores before fusion. With these changes, the final system achieves the actDCF and EER of 0.67 and 11.63%, respectively

    End to End ASR Free Keyword Spotting with Transfer Learning from Speech Synthesis

    Get PDF
    Keyword Spotting is an important application in speech. But it requires as much as data of an Automatic Speech Recognition(ASR).But the problem is much specific compare to that of an ASR. Here the work made an effort to reduce the transcribed data dependency while building the ASR. Traditional keyword spotting(KWS) architectures built on top of ASR. Such as Lattice indexing and Keyword filler models are very popular in this approach. Though they give very good accuracy the former suffers being a offline system, and the latter suffer from less accuracy Here we proposed an improvement to an approach called End-to-End ASR free Keyword Spotting. This system has been inspired from traditional keyword spotting architectures consist of three modules namely acoustic encoder and phonetic encoder and keyword neural network. The acoustic encoder process the speech features and gets a fixed length representation, same as phonetic encoder gets fixed length representation both concatenated to form input for keyword neural network. The keyword network predicts whether the keyword exist or not. Here we proposed to retain all the hidden representation to have temporal resolution to identify the location of the query. And also we propose to pretrain the phonetic encoder to make aware of acoustic projection. By doing these changes the performance is improved by 7.1% absolutely. And in addition to that system being end to end gives an advantage of making it easily deploy able

    Epoch Extraction from Allpass Residual of Speech Signals

    No full text
    Identification of epochs from speech signals is a prominent task in speech processing. In this paper, epoch extraction is attempted from phase spectrum of speech signals. The phase spectrum of speech is modelled as an allpass (AP) filter by minimizing entropy of energy in the associated error signal. The AP residual thus obtained contains prominent unambiguous peaks at epoch locations. These peaks in AP residual constitute a set of candidate epoch locations from which appropriate ones are identified using a dynamic programming algorithm. The proposed method is evaluated on a subset of CMU Arctic database and it is observed that it delivered better epoch extraction performance than the prominent speech events estimation method-DYPSA. In case of telephone channel speech, the proposed method significantly outperformed zero frequency resonator based method als

    Analysis of Phase Spectrum of Speech Signals Using Allpass Modeling

    No full text
    The phase spectrum of Fourier transform has received lesser prominence than its magnitude counterpart in speech processing. In this paper, we propose a method for parametric modeling of the phase spectrum, and discuss its applications in speech signal processing. The phase spectrum is modeled as the response of an allpass (AP) filter, whose coefficients are estimated from the knowledge of speech production process, especially the impulse-like nature of excitation source. A signal retaining only the phase spectral component of speech signal is derived by suppressing the magnitude spectral component, and is modeled as the output of an AP filter excited with a sequence of impulses. Entropy of energy of the input signal is minimized to estimate the coefficients of the AP filter. The resulting objective function, being nonconvex in nature, is minimized using particle swarm optimization. The group delay response of estimated AP filters can be used for accurate analysis of resonances of the vocal-tract system (VTS). The error signal associated with AP modeling provides unambiguous evidence about the instants of significant excitation of the VTS. The applications of the proposed AP modeling include, but not limited to, formant tracking, extraction of glottal closure instants, speaker verification and speech synthesis
    corecore